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Agenda 

Why do we need new APIs? 

What's new? 

• Parallelism 

• Explicit Memory Management 

• PSOs and Descriptor Sets 


Best Practices 



GAME OEVELOPERS CONFERENCE' EUROPE 2015 AUGUST 3-4, 2015 


GOCEUROPE.COM 


Evolution of 3D Graphics APIs 

• Software rasterization 

• 1996 Glide: Bilinear filtering + Transparency 

• 1998 DirectX 6: IHV independent + Multitexturing 

• 1999 DirectX 7: Hardware Texturing&Lighting + Cube Maps 

• 2000 DirectX 8: Programmable shaders + Tessellation 

• 2002 DirectX 9: More complex shaders 

• 2006 DirectX 10: Unified shader Model 

• 2009 DirectX 11: Compute 
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GPU performance increasing faster than single 

core CPU performance 
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How to get GPU bound 

• Optimize API usage (e.g. state caching & sorting) 

• Batch, Batch, Batch!!! 

• ~10k draw calls max 


• Allow multi-threaded command buffer recording 

• Reduce workload of runtime/driver 

• Reduce runtime validation 

• Move work to init/load time (e.g. Pipeline setup) 

• More explicit control over the hardware 

• Reduce convenience functions New APIs 
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Convenience function: Resource renaming 


CPU: Frame N Frame N+l 


GPU: 

Frame N 

Frame N+l 


Res Lifetime: 

Frame N 




> 

Frame N+l 



Examples: 

• Backbuffer 

• Dynamic vertex buffer 

• Dynamic constant buffer 

• Descriptor sets 

• Command buffer 


Track usage by End-Of-Frame-Fence 

• Fences are expensive 

• Use less than 10 fences per frame 

Best practice for constant buffers: 

• Use system memory (DX12: UPLOAD) 

• Keep it mapped 
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2013: AMD Mantle 

„Mantle is not for everyone" 

Adopted by several developers & titles 

• Developers are willing do the additional work 

• Significant performance improvements in games 

• Good ISVs don't need runtime validation 


Only available on AMD GCN hardware 
Needed standardization 
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2013: AMD Mantle 

„Mantle is not a low level API " 

• It's a „just the right level API" 

• Support different HW configurations 

• Discreet GPU vs. Integrated 

• Shaders & command buffer are HW specific 

• Support different HW generations 

• Think about future hardware 

• On PC, your title is never alone 
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Next Generation API features 

DirectX12 & Vulkan share the Mantle philosophy: 

• Minimize overhead 

• Minimize runtime validation 

• Allow multithreaded command buffer recording 

• Provide low level memory management 

• Support multiple asynchronous queues 

• Provide explicit access to multiple devices 
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The big question: „How much performance will I 
gain from porting my code to new APIs?" 

No magic involved! 

Depends on the current bottlenecks 

Depends a lot on engine design 

• Need to utilize new possibilities 

• It might „just work" (esp. if heavily CPU limited) 

• Might need redesign of engine (and asset pipeline) 
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(Designed around API commands) 
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Think Parallel! 

Keep the GPU busy 
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CPU side multithreading 

• Multi threaded command buffer building 

• Submission to queue is not thread safe 

• Split frame into macro render jobs 

• Offload shader compilation from main thread 

• Batch command buffer submission 

• Don't stall during submit/present 
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GPU side multithreading 



Compute Units 
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64 CU 

x 4 SIMD per CU 
x 10 Wavefronts per SIMD 
x 64 Threads per Wavefront 


Up to 163840 threads 


Batch, Batch, Ba 














GAME OEVELOPERS CONFERENCE' EUROPE 2015 AUGUST 3-4, 2015 


GOCEUROPE.COM 


GPU: Single graphics queue 



Multiple commands can execute in parallel 

• Pipeline (usually) must maintain pixel order 

• Load balancing is the main problem 
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Explicit Barriers & Transitions 

• Indicate RaW/WaW Hazards 

• Switch resource state between RO/RW/WO 

• Decompress DepthStencil/RTs 

• May cause a stall or cache flush 

• Batch them! 

• Split Barriers may help in the future 

• Always execute them on the last queue that 
wrote the resource 


Most common cause for bugs! 
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GPU: Barriers 
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Hard to detect Barriers in DX11 
Explicit in DX12 
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GPU: Barriers 



Time Query 


Batch them! 

[DX12] In the future split barriers may help 
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GPU underutilization 



Culling can cause bubbles of inactivity. 

Fetch latency is a common cause for underutilization 
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Multiple Queues 




• Let driver know about 
independent workloads 

• Each queue type a superset 

• Multiple queues per type 

• Specify type at record time 

• Parallel execution 

• Sync using fences 

• Shared GPU resources 
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Asynchronous Compute 


Bus dominated 

Shader throughput 

Geometry dominated 

Shadow mapping 
ROP heavy workloads 
G buffer operations 
DMA operations 

- Texture upload 

- Heap defrag 

Deferred lighting 
Postprocessing effects 
Most compute tasks 

- Texture compression 

- Physics 

- Simulations 

Rendering highly detailed 
models 


Multiple queues allow to specify tasks to execute in parallel 
Schedule different bottlenecks together to improve efficiency 
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Explicit MGPU 

DirectX 11 only supports one device 

• CF/SLI support essentially a driver hack 

• Increases latency 

Explicit MGPU allows 

• Split Frame Rendering 

• Master/Slave configurations 

• Split frame rendering 

• 3D/VR rendering using 2 dGPUs 
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Take Control! 


Explicit Memory Managagement 
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Explicit Memory Management 

• Control over heaps and residency 

• Abstraction for different architectures 

• VMM still exists 

• Use Evict/MakeResident to page out 
unused resources 

• Avoid oversubscribing resident memory! 
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Explicit Memory Management 

Rendertargets & UAVs 

• Create in DEFAULT 

Textures 

• Write to UPLOAD 

• Use copy queue to copy to DEFAULT 

• Copy swizzles: required on iGPU! 

Buffers (CB/VB/IB) 

• Placement dependent on usage: 

• Write once/Read once => UPLOAD 

• Write once/Read many => Copy to DEFAULT 
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Direct3D 12 Resource Creation APIs 
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Explicit Memory Management 

Don't over-allocate committed memory 

• Share LI with windows and other processes 

• Don't allocate more than 80% 

• Reduce memory footprint 

• Use placed resources to reduce overhead 

• Use reserved resources as PRT 

Allocate most important resources first 

Group resources used together in same heap 

• Use MakeResident/Evict 
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Avoid Redundancy! 

Organize your pipelines 
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PipelineStateObjects 


Full pipeline optimization 

• Simplifies optimization 
Additional information at startup 

• Shaders 

• Raster states 

• Static constants 
Build a pipeline cache 

• No pre-warming 

Most engines not designed for 
monolithic pipelines 
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Descriptor Sets 

Old APIs: 

• Single resource binding 

• A lot of work for the driver to track, validate and 
manage resource bindings 

• Data management scripting language style 

New APIs: 

• Group resources in descriptor sets 

• Pipelines contain „pointers" 

• Data management C/C++ style 
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Resource Binding 
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Table driven 

Shared across all shader stages 
Two-level table 

- Root Signature describes a top-level layout 

• Pointers to descriptor tables 

• Direct pointers to constant buffers 

• Inline constants 

Changing which table is pointed to is cheap 

- It's just writing a pointer 

- no synchronisation cost 

Changing contents of table is harder 

- Can't change table in flight on the 
hardware 

- No automatic renaming 
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PSO: Best Practices 

Use Shader and Pipeline cache 

• Avoid duplication 
Sort draw calls by PSO used 

• Sort by Tessellation/GS enabled/disabled 
Keep Root Descriptor small 

• Group DescriptorSets by update pattern 
Sort Root entries by frequency of update 

• Most frequently changed entries first 
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Top 5 Performance Advice 

# 5 . Avoid allocation/release at runtime 

# 4 . Don't oversubscribe! 

Manage your Memory efficiently 

# 3 . Batch, Batch, Batch! 

Group Barriers, group command buffer submissions 

# 2 . Think Parallel! 

On CPU as well as GPU 

# 1 . Old optimization recommendations still apply 
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Thank you! 


Contact: Stephan.Hodes@amd.com 

@Highflz 



